Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing fix #128

Merged
merged 11 commits into from
Jan 31, 2024
Merged

Tracing fix #128

merged 11 commits into from
Jan 31, 2024

Conversation

sarsko
Copy link
Contributor

@sarsko sarsko commented Nov 27, 2023

I want to add a bit more testing. At least something fairly minimal like connecting the tests to a tracing_subscriber::fmt which outputs to something else than stdout and then check on that, and a simple "X threads enter and exit Y spans and assert that the current span is what they expect" test.

Opening a PR in order to get feedback and suggestions. Feel free to suggest alternative test ideas as well. Would be nice to do a programmatic check against Tokio / no runtime as well.

Fixing tracing

I think I have done it right this time. One known divergence from Tokio (listed below), might be more, but I think the general scheme should work. Also adds a fix for spans stacking on test panic inside a catch_unwind (read: proptest minimization).

Overview of the changes

  • New config option, record_steps_in_span.
    • The step{i=X, task=Y} span is now step{task=Y} by default, and step{task=Y, i=X} if record_steps_in_span is true
  • Updating tracing span is no longer done in advance_to_next_task. It is instead done in step, by entering a Tasks span before continuing its continuation, and leaving the Span when we return from the continuation.
  • Added ResetSpanOnDrop which is constructed in run. This is to solve the issue of span stacking on panics inside a catch_unwind (read: proptest minimization). I found this way of solving it much cleaner than finding all the places we could exit a Shuttle test and making sure to clean up spans there.
  • A couple of traces which were previously emitted in whatever span was present (ie the span of a Task) are now emitted in (an extension of) Shuttle's span.

Some notes

  • I mentioned that I considered doing a "default to no span" approach, meaning that Shuttle would exit whatever span was present on test entry, store that, and then have an explicit enter/leave policy for all tracing, thus avoiding all accidental "span-bleed". I decided against this because ensuring that the correct span would be entered, for instance in thread_fn was cumbersome and thus verbose, and thus more error-prone than using whatever trace was present. This means there may be traces emitted where we would prefer to use a different span than the one we are using currently (as I have not looked into every single Event emission), these can be handled as we find them.
  • The in_scope in spawn_thread and Task::new() are matters of taste — I prefer for these to be emitted under Shuttle's span, but, they are reactions to actions taken by the function under test, so there is an argument for emitting them under whatever span is currently entered.

Known issues / discrepancies

Below are some tests which do the same thing using either Shuttle sync/future threads, std::sync threads and tokio::spawn threads. Shuttle sync/future and std::sync does the same thing, tokio::spawn does something different. I think the thing does currently is sane enough (arguably better than the mess you may get when running under Tokio — this is not how to combine futures and spans).

Tokio

async fn awaiting_is_relevant() {
    let lock = Arc::new(Mutex::new(0));
    let threads: Vec<_> = (0..3).map(|i| {
        let lock = lock.clone();
        tokio::spawn(async move {
            let outer = warn_span!("outer", tid=i);
            let _outer = outer.enter();

            let lock_clone = lock.clone();
            let t = tokio::spawn(async move 
            {
                let mut locked = lock_clone.lock().unwrap();
                let inner = warn_span!("inner1", tid=i);
                let _inner = inner.enter();
                error!("incrementing from {}", *locked);
                *locked += 1;
            });

            t.await.unwrap();

            let _t = tokio::spawn(async move 
            {
                let mut locked = lock.lock().unwrap();
                let inner = warn_span!("inner2", tid=i);
                let _inner = inner.enter();
                error!("incrementing from {}", *locked);
                *locked += 1;
            });

        })
    }).collect();

    for thread in threads {
        thread.await.unwrap();
    }
}

#[tokio::test]
async fn tokio_awaiting_is_relevant() {
    let _ = tracing_subscriber::fmt::try_init();
    awaiting_is_relevant().await;
}

Doing RUST_LOG=warn cargo test tokio_awaiting gives:

running 1 test
2023-11-27T16:33:30.579493Z ERROR outer{tid=0}:outer{tid=1}:outer{tid=2}:inner1{tid=0}: play: incrementing from 0
2023-11-27T16:33:30.579761Z ERROR outer{tid=0}:outer{tid=1}:outer{tid=2}:inner1{tid=1}: play: incrementing from 1
2023-11-27T16:33:30.579794Z ERROR outer{tid=0}:outer{tid=1}:outer{tid=2}:inner1{tid=2}: play: incrementing from 2
2023-11-27T16:33:30.579850Z ERROR inner2{tid=0}: play: incrementing from 3
2023-11-27T16:33:30.579869Z ERROR inner2{tid=1}: play: incrementing from 4
2023-11-27T16:33:30.579885Z ERROR inner2{tid=2}: play: incrementing from 5
test tokio_awaiting_is_relevant ... ok

As we can see, doing t.await.unwrap(); for the first t makes it execute under the parent spans, and not doing it for the second t makes those have no parent.

Shuttle future::spawn

async fn shuttle_awaiting_is_irrelevant() {
    let lock = Arc::new(Mutex::new(0));
    let threads: Vec<_> = (0..3)
        .map(|i| {
            let lock = lock.clone();
            shuttle::future::spawn(async move {
                let outer = warn_span!("outer", tid = i);
                let _outer = outer.enter();

                let lock_clone = lock.clone();
                let t = shuttle::future::spawn(async move {
                    let mut locked = lock_clone.lock().unwrap();
                    let inner = warn_span!("inner1", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });

                t.await.unwrap();

                let _t = shuttle::future::spawn(async move {
                    let mut locked = lock.lock().unwrap();
                    let inner = warn_span!("inner2", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });
            })
        })
        .collect();

    for thread in threads {
        thread.await.unwrap();
    }
}

#[test]
fn shuttle_test_awaiting_is_irrelevant() {
    let _ = tracing_subscriber::fmt::try_init();
    shuttle::check_random(
        move || {
            shuttle::future::block_on(async {
                shuttle_awaiting_is_irrelevant().await;
            })
        },
        1,
    );
}

Doing RUST_LOG=warn cargo test shuttle_test_awaiting gives:

running 1 test
2023-11-27T16:45:55.192800Z ERROR inner1{tid=0}: r#mod::basic::tracing: incrementing from 0
2023-11-27T16:45:55.194048Z ERROR inner1{tid=1}: r#mod::basic::tracing: incrementing from 1
2023-11-27T16:45:55.199122Z ERROR inner2{tid=0}: r#mod::basic::tracing: incrementing from 2
2023-11-27T16:45:55.200569Z ERROR inner2{tid=1}: r#mod::basic::tracing: incrementing from 3
2023-11-27T16:45:55.200632Z ERROR inner1{tid=2}: r#mod::basic::tracing: incrementing from 4
2023-11-27T16:45:55.200699Z ERROR inner2{tid=2}: r#mod::basic::tracing: incrementing from 5
test basic::tracing::shuttle_test_awaiting_is_irrelevant ... ok

As we can see, this differs from what we saw when running under Tokio.

No runtime std::thread::spawn

#[test]
fn std_thread_joining_is_irrelevant() {
    let _ = tracing_subscriber::fmt::try_init();
    let lock = Arc::new(std::sync::Mutex::new(0));
    let threads: Vec<_> = (0..3)
        .map(|i| {
            let lock = lock.clone();
            std::thread::spawn(move || {
                let outer = warn_span!("outer", tid = i);
                let _outer = outer.enter();

                let lock_clone = lock.clone();
                let t = std::thread::spawn(move || {
                    let mut locked = lock_clone.lock().unwrap();
                    let inner = warn_span!("inner1", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });

                t.join().unwrap();

                let _t = std::thread::spawn(move || {
                    let mut locked = lock.lock().unwrap();
                    let inner = warn_span!("inner2", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });
            })
        })
        .collect();

    for thread in threads {
        thread.join().unwrap();
    }
}

Doing RUST_LOG=warn cargo test std_thread -- --nocapture gives:

running 1 test
2023-11-27T17:38:49.695078Z ERROR inner1{tid=0}: r#mod::basic::tracing: incrementing from 0
2023-11-27T17:38:49.695650Z ERROR inner1{tid=1}: r#mod::basic::tracing: incrementing from 1
2023-11-27T17:38:49.696859Z ERROR inner1{tid=2}: r#mod::basic::tracing: incrementing from 2
2023-11-27T17:38:49.697743Z ERROR inner2{tid=0}: r#mod::basic::tracing: incrementing from 3
test basic::tracing::std_thread_joining_is_irrelevant ... ok
2023-11-27T17:38:49.697917Z ERROR inner2{tid=1}: r#mod::basic::tracing: incrementing from 4

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 

As we can see, the same as above.

Shuttle thread::spawn

fn shuttle_joining_is_irrelevant() {
    let lock = Arc::new(Mutex::new(0));
    let threads: Vec<_> = (0..3)
        .map(|i| {
            let lock = lock.clone();
            thread::spawn(move || {
                let outer = warn_span!("outer", tid = i);
                let _outer = outer.enter();

                let lock_clone = lock.clone();
                let t = thread::spawn(move || {
                    let mut locked = lock_clone.lock().unwrap();
                    let inner = warn_span!("inner1", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });

                t.join().unwrap();

                let _t = thread::spawn(move || {
                    let mut locked = lock.lock().unwrap();
                    let inner = warn_span!("inner2", tid = i);
                    let _inner = inner.enter();
                    error!("incrementing from {}", *locked);
                    *locked += 1;
                });
            })
        })
        .collect();

    for thread in threads {
        thread.join().unwrap();
    }
}

#[test]
fn shuttle_test_joining_is_irrelevant() {
    let _ = tracing_subscriber::fmt::try_init();
    shuttle::check_random(shuttle_joining_is_irrelevant, 1);
}

Running RUST_LOG=warn cargo test shuttle_test_join -- --nocapture gives:

running 1 test
2023-11-27T17:40:09.863872Z ERROR inner1{tid=0}: r#mod::basic::tracing: incrementing from 0
2023-11-27T17:40:09.876532Z ERROR inner2{tid=0}: r#mod::basic::tracing: incrementing from 1
2023-11-27T17:40:09.878446Z ERROR inner1{tid=2}: r#mod::basic::tracing: incrementing from 2
2023-11-27T17:40:09.878521Z ERROR inner1{tid=1}: r#mod::basic::tracing: incrementing from 3
2023-11-27T17:40:09.878573Z ERROR inner2{tid=2}: r#mod::basic::tracing: incrementing from 4
2023-11-27T17:40:09.878639Z ERROR inner2{tid=1}: r#mod::basic::tracing: incrementing from 5
test basic::tracing::shuttle_test_joining_is_irrelevant ... ok

As we can see, the same as the two above.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sarsko
Copy link
Contributor Author

sarsko commented Nov 28, 2023

Issue when running with RUST_LOG=info:

tried to clone Id(195906583790617176), but no span exists with that ID
This may be caused by consuming a parent span (`parent: span`) rather than borrowing it (`parent: &span`).
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: <tracing_subscriber::registry::sharded::Registry as tracing_core::subscriber::Subscriber>::clone_span
   3: <tracing_subscriber::fmt::Subscriber<N,E,F,W> as tracing_core::subscriber::Subscriber>::enter
   4: amzn_shuttle_fork::runtime::execution::Execution::step::{{closure}}::{{closure}}
   5: amzn_shuttle_fork::runtime::execution::Execution::step
   6: scoped_tls::ScopedKey<T>::set
   7: amzn_shuttle_fork::runtime::execution::Execution::run
   8: tracing::span::Span::in_scope
   9: scoped_tls::ScopedKey<T>::set
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@sarsko
Copy link
Contributor Author

sarsko commented Nov 28, 2023

There is a question of whether the step and execution spans should be ERROR so that they are always included. Having that information is pretty much always useful.

Copy link
Member

@jorajeev jorajeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the CR - this will debugging a lot easier, esp when using Shuttle with proptesting. And the *_awaiting_is_irrelevant examples are really helpful!

src/lib.rs Outdated
/// `tracing_subscriber::fmt`, appends to the span on calls to `record()` (instead of
/// overwriting), which results in traces which are hard to read if the task is scheduled more
/// than a few times.
/// Thus: set `record_steps_in_span` to `true` if you want this behaviour, or if you are using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this behavior you mean append behavior right? Could you make that clear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: do we really need to provide this option? What other format subscribers are there for which we'd want to set this to true?

Copy link
Contributor Author

@sarsko sarsko Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this behavior you mean append behavior right? Could you make that clear?

By this_behaviour I mean that we call record() on the step_span. In tracing_subscriber::fmt this maps to appends. In The Nice Subscriber this maps to overwrites.

Also: do we really need to provide this option? What other format subscribers are there for which we'd want to set this to true?

Having the step logged is super useful. I think this will be set to true for all my tests. I'll make a tracing_subscriber::fmt::Layer which handles this.

if let Some(span_id) = current_span.id().as_ref() {
subscriber.exit(span_id);
}
state.current_mut().span = current_span;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder if the field should be renamed from span to last_span ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?
Not super important what it is named, but why?
The span hasn't changed, we have just swapped out of the task. The span is the Span of the task, just as the continuation is its continuation.

src/runtime/execution.rs Outdated Show resolved Hide resolved
src/runtime/execution.rs Outdated Show resolved Hide resolved
src/runtime/execution.rs Outdated Show resolved Hide resolved
src/runtime/task/mod.rs Outdated Show resolved Hide resolved
src/runtime/task/mod.rs Outdated Show resolved Hide resolved
}
}

#[test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these tests are not checking anything, maybe mark them ignored for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'd like to add a Subscriber which checks them. Can ignore for now, then add that later (I'll open an issue).

tests/basic/tracing.rs Outdated Show resolved Hide resolved
tests/basic/tracing.rs Show resolved Hide resolved
- current_task -> parent_task_id
- span -> top_level_span
- parent_id -> parent_span_id
- #[ignore] tracing tests
- move test_tracing_nested_spans up
@sarsko
Copy link
Contributor Author

sarsko commented Dec 13, 2023

Nevermind, I did not look into the non ERROR/WARN-traces sufficiently. There is Span-stacking on instrumented futures.

@sarsko
Copy link
Contributor Author

sarsko commented Jan 6, 2024

Interestingly the issues with instrumented futures don't show up when running with Tracing 0.2.0 (the current pre-release version).

src/lib.rs Show resolved Hide resolved
src/runtime/execution.rs Outdated Show resolved Hide resolved
src/runtime/execution.rs Show resolved Hide resolved
src/runtime/execution.rs Show resolved Hide resolved
src/runtime/runner.rs Outdated Show resolved Hide resolved
src/runtime/task/mod.rs Outdated Show resolved Hide resolved
src/runtime/task/mod.rs Outdated Show resolved Hide resolved
tracing::info_span!(parent: parent_span.id(), "step", i = schedule_len, task = id.0)
};
let step_span = error_span!(parent: parent_span_id.clone(), "step", task = id.0, i = field::Empty);
// Note that this is slightly lazy — we are starting storing at the step_span, but could have gotten the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if we know that storing the full stack would be better, why are we cutting corners here? Why not just do the right thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't think it is better / I don't believe this is cutting corners. This comment and the other one of a similar nature are meant as the following: "I really believe the current scheme is correct. However, spans are both weird and tricky. If there ever are any issues, then start by doing the changes in the comments, in order to definitely rule them out.". In other words: I think the current way of doing it is "just right", and that the suggestions are "too much". I'd rather do the "just right" approach, both for overall code cleanness, and for the opportunity to be proven wrong (and thus learn more about Spans), but leave the comments in case I am wrong.

// All of the tests testing tracing will either have to have weak assertions,
// or be #[ignore]d. The reason for this is that they are not thread safe
// (since everything tracing is managed globally), and there is no way to
// ensure that the tests are run single-threaded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this crate: https://docs.rs/serial_test/latest/serial_test/index.html

See if it's worth using here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that it won't solve the issue, unless all other tests are marked with parallel, which isn't worth it. This is because marking a test with serial only guarantees that it won't be ran at the same time as other serial or parallel tests, but it can still run at the same time as tests which are not marked.

tests/basic/tracing.rs Outdated Show resolved Hide resolved
@jorajeev jorajeev merged commit d86479b into awslabs:main Jan 31, 2024
5 checks passed
jorajeev pushed a commit that referenced this pull request Feb 29, 2024
* Tracing fix

* Own Span in Task to avoid dropping the Span while Task is switched out

* Respond to comments by @jorajeev

- current_task -> parent_task_id
- span -> top_level_span
- parent_id -> parent_span_id
- #[ignore] tracing tests
- move test_tracing_nested_spans up

* Make the execution_span and the step_span ERROR

* Add instrumented futures test

* Fix issue with tracing and instrumented futures

* Update instrumented futures tracing test to highlight top level span failures

* Clippy unneeded async

* Update ResetSpanOnDrop to store an owned copy of the Span

* Address comments by @jorajeev
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants